Read in the
gapminder_clean.csvdata as atibbleusingread_csv.
df <- as_tibble(read.csv("gapminder_clean.csv",
check.names = FALSE,
stringsAsFactors = TRUE
)[, -1]) %>%
mutate_all(~ na_if(., "")) %>%
mutate(Year = as.factor(Year))
datatable(df, caption = "Loaded dataset", options = list(
pageLength = 50, scrollX = "400px"
))
vars <- data.frame(
`Unique values` = sapply(lapply(df, unique), length),
`NA count` = colSums(is.na(df)), Type = sapply(df, class),
check.names = FALSE
)
kable(vars, caption = "Columns (variables) in dataset")
| Unique values | NA count | Type | |
|---|---|---|---|
| Country Name | 263 | 0 | factor |
| Year | 10 | 0 | factor |
| Agriculture, value added (% of GDP) | 1395 | 1179 | numeric |
| CO2 emissions (metric tons per capita) | 2174 | 414 | numeric |
| Domestic credit provided by financial sector (% of GDP) | 1692 | 864 | numeric |
| Electric power consumption (kWh per capita) | 1338 | 1238 | numeric |
| Energy use (kg of oil equivalent per capita) | 1380 | 1197 | numeric |
| Exports of goods and services (% of GDP) | 1771 | 798 | numeric |
| Fertility rate, total (births per woman) | 1939 | 184 | numeric |
| GDP growth (annual %) | 1880 | 691 | numeric |
| Imports of goods and services (% of GDP) | 1771 | 798 | numeric |
| Industry, value added (% of GDP) | 1388 | 1189 | numeric |
| Inflation, GDP deflator (annual %) | 1671 | 706 | numeric |
| Life expectancy at birth, total (years) | 2381 | 189 | numeric |
| Population density (people per sq. km of land area) | 2537 | 49 | numeric |
| Services, etc., value added (% of GDP) | 1388 | 1186 | numeric |
| pop | 1285 | 1323 | numeric |
| continent | 6 | 1323 | factor |
| gdpPercap | 1285 | 1323 | numeric |
There are a total of 19 variables in 2607 rows in the dataset. Most of the variables are continuous, while only Country Name, Year and continent are discrete (All discrete variables will be treated as factors in the remaining of this analysis).
Empty strings were converted to NA values. It can be observed that there is an important number of NA values in the dataset across almost all variables. NA values are later filtered for the purpose of specific analyses.
There is a total of 263 countries across 10 specific years spanning from 1962 to 2007 in increments of 5 (e.g. 1962, 1967, 1972, etc), including 6 continents.
df %>%
dplyr::select(where(is.numeric)) %>%
gather() %>%
filter(!is.na(value)) %>%
ggplot(aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = "free")
Most continuous variables show a somewhat exponential distribution. The exceptions being: Fertility rate, total (births per woman) (somewhat uniform), Life expectancy at birth, total (years) (somewhat normal with a skew), and Services, etc., value added (% of GDP) (normal).
Filter the data to include only rows where
Yearis1962and then make a scatter plot comparing'CO2 emissions (metric tons per capita)'andgdpPercapfor the filtered data.
df_1962 <- df %>%
filter(Year == 1962) %>%
filter(!is.na(gdpPercap) &
!is.na(`CO2 emissions (metric tons per capita)`)) %>%
dplyr::select(
`Country Name`, Year, gdpPercap,
`CO2 emissions (metric tons per capita)`
)
kable(summary(df_1962, maxsum = 0), caption = "Summary of variables")
| Country Name | Year | gdpPercap | CO2 emissions (metric tons per capita) | |
|---|---|---|---|---|
| (Other):108 | (Other):108 | Min. : 355.2 | Min. : 0.00848 | |
| NA | NA | 1st Qu.: 1064.7 | 1st Qu.: 0.15407 | |
| NA | NA | Median : 2510.7 | Median : 0.37623 | |
| NA | NA | Mean : 5173.3 | Mean : 2.35303 | |
| NA | NA | 3rd Qu.: 6157.1 | 3rd Qu.: 2.61142 | |
| NA | NA | Max. :95458.1 | Max. :42.63712 |
After filtering for the desired year, there is corresponding data for 108 countries (removing NA values). gdpPercap ranges from 355.2 to 95,458, and showing a fairly linear relationship with CO2 emissions (metric tons per capita) which ranges from 0.008481 to 42.64.
p <- df_1962 %>%
ggplot(aes(x = gdpPercap, y = `CO2 emissions (metric tons per capita)`)) +
geom_point(aes(text = `Country Name`)) +
labs(title = "CO2 emissions vs GDP", x = "GPD Per Capita") +
theme_bw() +
scale_y_log10() +
scale_x_log10() +
geom_smooth(method = lm, formula = y ~ x)
ggplotly(p)
On the filtered data, calculate the correlation of
'CO2 emissions (metric tons per capita)'andgdpPercap. What is the correlation and associated p value?
result <- cor.test(df_1962$gdpPercap,
df_1962$`CO2 emissions (metric tons per capita)`,
use = "complete.obs"
)
print(result)
##
## Pearson's product-moment correlation
##
## data: df_1962$gdpPercap and df_1962$`CO2 emissions (metric tons per capita)`
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8934697 0.9489792
## sample estimates:
## cor
## 0.9260817
The correlation between said variables is 0.9261 with a p-value of 1.129^{-46} (highly significant), meaning the data is well correlated.
On the unfiltered data, answer “In what year is the correlation between
'CO2 emissions (metric tons per capita)'andgdpPercapthe strongest?” Filter the dataset to that year for the next step…
corr_by_year <- df %>%
group_by(Year) %>%
summarize(Correlation = cor(gdpPercap,
`CO2 emissions (metric tons per capita)`,
use = "complete.obs"
)) %>%
arrange(desc(Correlation))
kable(corr_by_year, caption = "Correlation by year")
| Year | Correlation |
|---|---|
| 1967 | 0.9387918 |
| 1962 | 0.9260817 |
| 1972 | 0.8428986 |
| 1982 | 0.8166384 |
| 1987 | 0.8095531 |
| 1992 | 0.8094316 |
| 1997 | 0.8081396 |
| 2002 | 0.8006421 |
| 1977 | 0.7928336 |
| 2007 | 0.7204169 |
Correlation between 'CO2 emissions (metric tons per capita)' and gdpPercap is the strongest on year 1967.
Using
plotly, create an interactive scatter plot comparing'CO2 emissions (metric tons per capita)'andgdpPercap, where the point size is determined bypop(population) and the color is determined by thecontinent. You can easily convert anyggplotplot to aplotlyplot using theggplotly()command.
df_1967 <- df %>%
filter(Year == 1967) %>%
filter(!is.na(gdpPercap) &
!is.na(`CO2 emissions (metric tons per capita)`)) %>%
dplyr::select(
`Country Name`, Year, gdpPercap,
`CO2 emissions (metric tons per capita)`, continent, pop
)
kable(summary(df_1967, maxsum = 0), caption = "Summary of variables")
| Country Name | Year | gdpPercap | CO2 emissions (metric tons per capita) | continent | pop | |
|---|---|---|---|---|---|---|
| (Other):113 | (Other):113 | Min. : 349 | Min. : 0.0118 | (Other):113 | Min. : 70787 | |
| NA | NA | 1st Qu.: 1136 | 1st Qu.: 0.1813 | NA | 1st Qu.: 2427334 | |
| NA | NA | Median : 2742 | Median : 0.5749 | NA | Median : 5212416 | |
| NA | NA | Mean : 5768 | Mean : 2.7235 | NA | Mean : 25359863 | |
| NA | NA | 3rd Qu.: 7656 | 3rd Qu.: 4.4315 | NA | 3rd Qu.: 12607312 | |
| NA | NA | Max. :80895 | Max. :43.4283 | NA | Max. :754550000 |
p <- df_1967 %>%
ggplot(aes(x = gdpPercap, y = `CO2 emissions (metric tons per capita)`)) +
geom_point(aes(text = `Country Name`, col = continent, size = pop)) +
labs(title = "CO2 emissions vs GDP", x = "GPD Per Capita") +
theme_bw() +
scale_y_log10() +
scale_x_log10() +
geom_smooth(method = lm, formula = y ~ x)
ggplotly(p)
Corresponding data exist for 113 countries for year 1967. The relationship between 'CO2 emissions (metric tons per capita)' and gdpPercap remains highly linear.
What is the relationship between
continentand'Energy use (kg of oil equivalent per capita)'? (stats test needed)
df_continent <- df %>%
filter(!is.na(continent) &
!is.na(`Energy use (kg of oil equivalent per capita)`)) %>%
dplyr::select(
continent,
`Energy use (kg of oil equivalent per capita)`, pop,
`Country Name`
)
kable(summary(df_continent, maxsum = 6)[, 1:3],
caption = "Summary of variables"
)
| continent | Energy use (kg of oil equivalent per capita) | pop | |
|---|---|---|---|
| : 0 | Min. : 9.715 | Min. :1.821e+05 | |
| Africa :199 | 1st Qu.: 484.129 | 1st Qu.:4.437e+06 | |
| Americas:188 | Median : 995.564 | Median :1.015e+07 | |
| Asia :185 | Mean : 1992.607 | Mean :4.389e+07 | |
| Europe :256 | 3rd Qu.: 2895.659 | 3rd Qu.:3.155e+07 | |
| Oceania : 20 | Max. :14746.031 | Max. :1.319e+09 |
There is a total of 848 rows without NA values. Observation count is fairly balanced between continents with the exception of Oceania, where there is data for only 20 rows.
model_base <- lm(
`Energy use (kg of oil equivalent per capita)` ~ continent,
df_continent
)
model_base_summary <- summary(model_base)
kable(tidy(model_base), caption = "Linear regression terms")
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 698.5168 | 137.2382 | 5.089811 | 4e-07 |
| continentAmericas | 1005.1037 | 196.9027 | 5.104570 | 4e-07 |
| continentAsia | 1168.7636 | 197.7220 | 5.911147 | 0e+00 |
| continentEurope | 2447.5453 | 182.9620 | 13.377343 | 0e+00 |
| continentOceania | 3281.7976 | 454.1321 | 7.226526 | 0e+00 |
kable(glance(model_base), caption = "Linear regression statistics")
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.1962523 | 0.1924386 | 1935.984 | 51.45916 | 0 | 4 | -7618.731 | 15249.46 | 15277.92 | 3159591816 | 843 | 848 |
A linear regression for this data set shows that there are statistically significant differences across all continents (p-values < 0.05) in energy use. Being Oceania the continent with the highest consumption, and the lowest being Africa (which is the base, i.e. just intercept value). The regression coefficients show that, for example Asia consumes, in average, 1169 kg of oil equivalent yearly per person more than Africa, and 164 kg (\(1169-1005\)) more than the Americas.
Overall, adjusted R-squared is 0.1924386 meaning that 19.24% of observed variance in 'Energy use (kg of oil equivalent per capita)' is explained by continent.
plot(model_base)
Looking at the diagnostic plots, the Residuals vs Fitted shows a somewhat horizontal line (with a small valley), hinting that the relationship between the variables is fairly linear but variables could use transformation (explored below), this is supported by the observed separation (points vs dotted line) at the right side of the Normal Q-Q plot.
The Scale-Location plot hints heteroscedasticity (non constant variance), which is not surprising because all the others variables in the data set are left out, including a time variable (Year), which could hint that the mean of 'Energy use (kg of oil equivalent per capita)' does not remain constant over time (which is to be expected).
Finally Residuals vs Leverage plot does not show points beyond Cook’s distance, hinting that outliers are not to be expected (although there is a set of observations with high leverage, which could be further explored).
model_log <- lm(log10(`Energy use (kg of oil equivalent per capita)`) ~
continent, df_continent)
model_log_summary <- summary(model_log)
kable(tidy(model_base),
caption = "Linear regression (log transformed) terms"
)
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 698.5168 | 137.2382 | 5.089811 | 4e-07 |
| continentAmericas | 1005.1037 | 196.9027 | 5.104570 | 4e-07 |
| continentAsia | 1168.7636 | 197.7220 | 5.911147 | 0e+00 |
| continentEurope | 2447.5453 | 182.9620 | 13.377343 | 0e+00 |
| continentOceania | 3281.7976 | 454.1321 | 7.226526 | 0e+00 |
kable(glance(model_base),
caption = "Linear regression (log transformed) statistics"
)
| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.1962523 | 0.1924386 | 1935.984 | 51.45916 | 0 | 4 | -7618.731 | 15249.46 | 15277.92 | 3159591816 | 843 | 848 |
In this case, the only variable that could be transformed is Energy use (kg of oil equivalent per capita), as continent is discrete.
Log transforming energy use improved the fit, yielding an adjusted R-squared of 0.3552871 (a significant increase from 0.1924386).
Overall, this model shows benefits from log transforming Energy use (kg of oil equivalent per capita) as more variance can be explained by continent.
While interpreting the coefficients, as shown below, for the log transformed model it is important to take into account that the unit is in log10 scale, meaning, for example in the case of Oceania, that the log10 of Energy use (kg of oil equivalent per capita) increases by 0.85552 from its base (Africa/the intercept) when the country is in Oceania. As an example, an estimation of the energy use when the country is in Oceania would be: \(10^{(2.72558+0.85552)}\)
plot(model_log)
The better fit is also evidenced by improvement in the diagnostic plots. As can be seen in Residuals vs Fitted showing a more horizontal line with a smaller peak than previously seen, and Normal Q-Q plot showing that the right tail of points follow the dotted line closer.
Scale-Location and Residuals vs Leverage remain similar to what was previously seen.
Non linear relationships were left out of this analysis.
df_continent <- df %>%
filter(!is.na(continent) &
!is.na(`Energy use (kg of oil equivalent per capita)`))
model_aov <- aov(log10(`Energy use (kg of oil equivalent per capita)`) ~
continent, df_continent)
aov_summary <- summary(model_aov)
kable(tidy(model_aov), caption = "ANOVA terms")
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| continent | 4 | 64.65929 | 16.1648216 | 117.6908 | 0 |
| Residuals | 843 | 115.78600 | 0.1373499 | NA | NA |
Not surprisingly, ANOVA confirms that there are statistical differences between continents for the variable in question, showing a very significant p-value of 9.1902822^{-80} (the interpretation of p-value is explained in more detail in the next analysis).
p <- df_continent %>%
ggplot(aes(
x = continent,
y = `Energy use (kg of oil equivalent per capita)`
)) +
geom_boxplot(aes(col = continent)) +
labs(title = "Energy use vs continent", x = "GPD Per Capita") +
scale_y_log10()
ggplotly(p)
Differences in energy use between continents are evidenced visually in the box plot above, showing different Y-axis locations across continents. Africa is the continent with the smallest energy use, followed by Asia and the Americas which are similar among each other, while Europe is most similar with Oceania (albeit Oceania shows less dispersion, driven mostly by its limited number of observations).
It is to be noted that the countries with the max consumption of energy is seen in America and Europe, while the minimum is seen in Africa.
Is there a significant difference between Europe and Asia with respect to
'Imports of goods and services (% of GDP)'in the years after 1990? (stats test needed)
df_1990 <- df %>%
filter(as.numeric(as.character(Year)) > 1990 &
continent %in% c("Europe", "Asia") &
!is.na(`Imports of goods and services (% of GDP)`)) %>%
dplyr::select(continent, `Imports of goods and services (% of GDP)`, Year)
kable(summary(df_1990), caption = "Summary of variables")
| continent | Imports of goods and services (% of GDP) | Year | |
|---|---|---|---|
| : 0 | Min. : 0.07951 | 2002 :56 | |
| Africa : 0 | 1st Qu.: 27.71081 | 2007 :56 | |
| Americas: 0 | Median : 38.77832 | 1997 :53 | |
| Asia : 98 | Mean : 44.12648 | 1992 :47 | |
| Europe :114 | 3rd Qu.: 56.12958 | 1962 : 0 | |
| Oceania : 0 | Max. :183.91553 | 1967 : 0 | |
| NA | NA | (Other): 0 |
Complete corresponding data exists for 212 rows after 1990.
anova <- aov(`Imports of goods and services (% of GDP)` ~ continent, df_1990)
aov_summary <- summary(anova)
kable(tidy(model_aov), caption = "ANOVA terms")
| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| continent | 4 | 64.65929 | 16.1648216 | 117.6908 | 0 |
| Residuals | 843 | 115.78600 | 0.1373499 | NA | NA |
An ANOVA test does not show significant difference between Europe and Asia with respect to 'Imports of goods and services (% of GDP)' with a p-value of 0.1575197 for the F statistic. Meaning there is a 15.75% chance that the observed variation is due to random chance, which is usually not enough to reject the null hypothesis (that the samples come from the same population). P-value should be at least lower than 0.05 to reject the null hypothesis with a 95% confidence interval.
p <- df_1990 %>%
ggplot(aes(
x = continent,
y = `Imports of goods and services (% of GDP)`, col = continent
)) +
geom_boxplot()
ggplotly(p)
The box plots visually confirms there are not statistical differences. Both samples have a similar mean and somewhat similar interquartile range, although dispersion is higher in Asia.
What is the country (or countries) that has the highest
'Population density (people per sq. km of land area)'across all years? (i.e., which country has the highest average ranking in this category across each time point in the dataset?)
df_pop <- df %>%
filter(!is.na(`Population density (people per sq. km of land area)`)) %>%
group_by(`Country Name`) %>%
summarize(
AvgPopDensity =
mean(`Population density (people per sq. km of land area)`)
) %>%
arrange(desc(AvgPopDensity))
kable(summary(df_pop, maxsum = 0), caption = "Summary of variables")
| Country Name | AvgPopDensity | |
|---|---|---|
| (Other):262 | Min. : 0.139 | |
| NA | 1st Qu.: 20.014 | |
| NA | Median : 47.132 | |
| NA | Mean : 261.664 | |
| NA | 3rd Qu.: 121.271 | |
| NA | Max. :14732.035 |
df_pop_top <- df_pop %>%
top_n(5, wt = AvgPopDensity)
p <- df %>%
filter(`Country Name` %in% df_pop_top$`Country Name`) %>%
ggplot(aes(
y = `Population density (people per sq. km of land area)`,
x = reorder(
`Country Name`,
-`Population density (people per sq. km of land area)`
),
col = `Country Name`
)) +
geom_boxplot() +
labs(title = "Top 5 countries by pop. density", x = "Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
top_countries <- paste(df_pop_top[
1:2,
"Country Name"
]$`Country Name`, sep = ", ")
ggplotly(p)
Average population density varies greatly (0.1394 to 14,732). The top 2 countries with the highest 'Population density (people per sq. km of land area)' across all years are Macao SAR, China, Monaco.
What country (or countries) has shown the greatest increase in
'Life expectancy at birth, total (years)'since 1962?
df_life_incr <- df %>%
group_by(`Country Name`) %>%
filter(!is.na(`Life expectancy at birth, total (years)`)) %>%
mutate(
`Life expectancy 1962` =
`Life expectancy at birth, total (years)`[match(1962, Year)],
.keep = "all"
) %>%
filter(!is.na(`Life expectancy 1962`)) %>%
filter(Year == 2007) %>%
mutate(
Increase_LE =
`Life expectancy at birth, total (years)` /
`Life expectancy 1962`
) %>%
dplyr::select(
continent, `Country Name`, `Life expectancy 1962`,
`Life expectancy at birth, total (years)`, Increase_LE
) %>%
arrange(desc(Increase_LE)) %>%
ungroup()
kable(summary(df_life_incr), caption = "Summary of variables")
| continent | Country Name | Life expectancy 1962 | Life expectancy at birth, total (years) | Increase_LE | |
|---|---|---|---|---|---|
| : 0 | Afghanistan : 1 | Min. :28.55 | Min. :44.18 | Min. :0.8451 | |
| Africa : 47 | Albania : 1 | 1st Qu.:44.82 | 1st Qu.:62.03 | 1st Qu.:1.1477 | |
| Americas: 24 | Algeria : 1 | Median :54.29 | Median :71.34 | Median :1.2569 | |
| Asia : 25 | Angola : 1 | Mean :54.26 | Mean :68.74 | Mean :1.2995 | |
| Europe : 29 | Antigua and Barbuda: 1 | 3rd Qu.:64.73 | 3rd Qu.:75.15 | 3rd Qu.:1.4345 | |
| Oceania : 2 | Arab World : 1 | Max. :73.72 | Max. :82.51 | Max. :2.0032 | |
| NA’s :109 | (Other) :230 | NA | NA | NA |
In average, all countries increased life expectancy by ~30% from 1962 to 2007 (shown as a blue dotted line in the plot), however the top countries in this ranking yielded a ~100% (or 2x) increase. Continent data was incomplete so that was left out of the analysis.
The plot below shows the distribution of life expectancy increase (Increase_LE) across countries. Something to notice is that after the top 11th country (China) the rate of decrease starts slowing down (showing an elbow in the plot). For this reason, the top 11 countries were selected as the answer to the question presented, and plotted as a bar chart at the end.
p <- df_life_incr %>%
ggplot(aes(y = Increase_LE, x = reorder(`Country Name`, -Increase_LE))) +
geom_point() +
labs(title = "Countries by life expectancy increase since 1962") +
geom_hline(
yintercept = mean(df_life_incr$Increase_LE),
linetype = "dotted", color = "blue"
) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
ggplotly(p)
p <- df_life_incr %>%
top_n(11, wt = Increase_LE) %>%
ggplot(aes(
y = Increase_LE, x = reorder(`Country Name`, -Increase_LE),
fill = `Country Name`
)) +
geom_bar(stat = "identity") +
labs(
title =
"Top countries by life expectancy increase since 1962",
x = "Country"
) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
ggplotly(p)